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Preface 


This  technical  report  relates  to  a  previous  work  that  explores  the  application  of 
categorical  and  object-based  verification  methods  to  verify  spatial  forecasts 
produced  by  the  Weather  Running  Estimate-Nowcast  (WRE-N)  of  continuous 
meteorological  variables  that  have  been  filtered  by  a  single  threshold.  These 
methods  use  gridded  forecasts  and  observations  on  a  common  grid,  which  enables 
the  application  a  number  of  different  spatial  verification  methods  that  reveal  various 
aspects  of  model  performance.  This  report  describes  the  results  obtained  when  the 
same  categorical  method,  called  “spatial  categorical”  in  this  report,  was  applied  to 
the  same  data  to  determine  the  ability  of  the  WRE-N  to  predict  objects  defined  by 
multiple  thresholds.  Thus,  portions  of  this  report’s  content  originated  in 
ARL-TR-7751.1 


1  1  Raby  JW,  Cai  H.  Verification  of  spatial  forecasts  of  continuous  meteorological  variables  using 
categorical  and  object-based  methods.  White  Sands  Missile  Range  (NM):  Army  Research 
Laboratory  (US);  2016  Aug.  Report  No.:  ARL-TR-7751. 
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Executive  Summary 


Spatial  forecasts  from  Numerical  Weather  Prediction  (NWP)  models  of  tactically 
significant  meteorological  variables  to  support  US  Army  operations  on  the 
battlefield  have  become  an  integral  part  of  the  products  available  for  the  Air  Force 
Staff  Weather  Officer  to  use  in  providing  mission  planning  and  execution  forecasts. 
These  forecasts  are  ingested  by  certain  Army  tactical  decision  aids  (TDAs).  Such 
TDAs  fuse  information  on  the  characteristic  operational  weather  thresholds  that 
potentially  affect  (impact)  missions  and  performance  of  systems  conducting  the 
missions  with  the  spatial  forecast  information  from  NWPs.  The  TDA  generates 
spatial  forecasts  of  these  impacts  for  user-specified  systems  and/or  missions  and 
for  the  time  period  and  location  of  interest. 1  This  report  presents  the  results  obtained 
by  applying  a  spatial-categorical  method  that  can  verify  spatial  forecast  fields  of 
meteorological  variables  that  have  been  filtered  by  the  application  of  a  threshold  or 
category  the  same  way  as  that  used  by  the  TDA.  In  effect,  a  threshold  applied  to  a 
continuous  variable  field  becomes  a  categorical  forecast  for  which  there  are 
traditional  and  nontraditional  methods  for  verification.  This  study  evaluates  the 
ability  of  the  NWP  model  to  predict  multiple  categories  of  the  spatial  variable  and 
compares  the  skill  of  the  model  for  the  different  categories. 

Traditional  methods  have  been  developed  to  verily  the  skill  of  NWP  to  predict 
categories  of  continuous  meteorological  variables.  These  methods  apply  the 
established  theoretical  framework  for  evaluating  deterministic  binary  forecasts. 
This  framework  involves  defining  a  binary  event  through  the  application  of  a 
category  or  threshold  and  evaluates  the  forecast  skill  by  counting  the  numbers  of 
times  the  event  was  forecast  or  not  and  observed  or  not  in  a  contingency  table. 
There  are  numerous  statistics  and  skill  scores  that  can  be  computed  from  the  data 
collected  by  this  method.  For  this  study,  the  author  obtained  forecasts  from  the 
Army’s  Weather  Running  Estimate-Nowcast,  which  is  an  Advanced  Research 
version  of  the  Weather  Research  and  Forecasting  Model  adapted  for  generating 
short-range  nowcasts  and  gridded  observations  produced  by  the  National 
Oceanographic  and  Atmospheric  Administration’s  Global  Systems  Division  using 
the  Local  Analysis  and  Prediction  System.  A  tool  developed  by  the  National  Center 
for  Atmospheric  Research  called  MET  Series-Analysis  was  used  to  generate  the 
skill  scores  and  statistics  at  every  grid  point;  then,  generate  graphical  products  that 
display  the  spatial  distribution  of  the  scores  and  statistics  for  each  of  4  categories. 


1  Johnson  J.  Personal  communication.  White  Sands  Missile  Range  (NM):  Army  Research 
Laboratory  (US);  2017  June  17. 
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Preliminary  results  suggest  the  skill  of  the  model  when  predicting  objects  defined 
by  lower  thresholds  is  greater  than  the  skill  for  objects  defined  by  higher  thresholds. 
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1.  Introduction  and  Background 


As  computing  technology  has  advanced,  the  weather-forecasting  task,  once  the 
primary  role  of  a  human  forecaster  in  theater,  has  shifted  to  computerized 
Numerical  Weather  Prediction  (NWP)  models.  Scientists  around  the  world  have 
used  the  Weather  Research  and  Forecasting  model  (WRF)  extensively  for  many 
applications.  In  this  study,  the  model  used  was  the  Advanced  Research  version  of 
WRF  (Skamarock  et  al.  2008)  that  we  abbreviate  as  WRF-ARW.  WRF-ARW 
includes  Four -Dimensional  Data  Assimilation  (FDD A)  techniques  that  can  be  used 
to  incorporate  observations  into  the  model  so  that  forecast  quality  is  improved 
(Stauffer  and  Seaman  1994;  Deng  et  al.  2009).  The  US  Army  Research  Laboratory 
(ARL)  uses  WRF-ARW  as  the  core  of  its  Weather  Running  Estimate-Nowcast 
(WRE-N)  weather-forecasting  model. 

The  Army  requires  high-resolution  weather  forecasting  to  model  atmospheric 
features  with  wavelengths  on  the  order  of  5  km  or  less;  that  imposes  a  requirement 
for  NWP  to  operate  on  a  model  grid  spacing  on  the  order  of  1  km  or  less  in  the 
finest,  or  most  resolved,  domain  to  resolve  weather  phenomena  of  interest  to  the 
Soldier  in  theater.  The  atmospheric  flows  of  interest  to  the  Army  include 
mountain/valley  breezes,  sea  breezes,  and  other  flows  induced  by  differences  in 
land-surface  characteristics.  High-resolution  NWP  forecasts  need  to  be  validated 
against  observations  before  their  outputs  can  be  used  effectively  by  My  Weather 
Impacts  Decision  Aid  (MyWIDA),  an  Army-developed  decision  aid  used  to 
determine  atmospheric  impacts  on  Army  and  Joint  systems  and  operations  (Brandt 
et  al.  2013).  Weather-forecast  validation  has  always  been  of  interest  to  the  civilian 
and  military  weather-forecasting  community;  see,  for  example,  the  reviews  by 
Ebert  et  al.  (2013)  and  Casati  et  al.  (2008)  or  the  books  by  Jolliffe  and  Stephenson 
(2012)  or  Wilks  (2011).  The  validation  of  the  models,  especially  high-resolution 
NWP,  has  proven  to  be  especially  difficult  when  addressing  small  temporal  and 
spatial  scales  (NRC  2010)  that  characterize  NWP  for  use  in  Army  applications. 
Furthermore,  the  verification  of  WRE-N  spatial  fields  of  continuous 
meteorological  variables  that  have  been  filtered  by  the  application  of  a  threshold 
has  not  been  accomplished. 

The  WRF  model  is  maintained  by  the  National  Center  for  Atmospheric  Research 
(NCAR),  which  has  also  developed  a  suite  of  Model  Evaluation  Tools  (MET) 
(NCAR  2013)  to  evaluate  WRF-ARW  performance.  MET  was  developed  at 
NCAR  through  a  grant  from  the  US  Air  Force  557th  Weather  Wing  (formerly  the 
Air  Force  Weather  Agency).  NCAR  is  sponsored  by  the  National  Science 
Foundation.  MET  Series-Analysis  performs  spatial-categorical  verification  of 
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gridded  model  output  against  observations  that  have  been  analyzed  and  placed  on 
a  grid  matching  that  of  the  model. 

ARL  has  employed  MET  Series-Analysis  in  a  prior  study,  the  results  of  which  are 
presented  by  Raby  and  Cai  (2016).  They  evaluated  the  applicability  of  a 
combination  of  a  categorical  and  object-based  technique  for  assessing  the  1.75-km 
grid  spacing  WRE-N  model  to  demonstrate  the  utility  of  combining  traditional  and 
nontraditional  techniques  for  assessing  the  ability  of  the  model  to  predict  objects 
defined  by  application  of  a  single  threshold. 

ARL’ s  collaborations  with  the  National  Oceanic  and  Atmospheric  Administration’ s 
(NOAA’s)  Global  Systems  Division  (GSD)  resulted  in  the  generation  of  1.75-km 
grids  of  observations  of  surface  meteorological  variables  for  the  same  domain  as 
the  WRE-N  using  the  NOAA-GSD  Local  Analysis  and  Prediction  System  (LAPS). 

The  WRE-N  was  run  with  and  without  FDDA  for  5  case-study  days  over  a 
1 .75 -km  grid-spacing  domain  in  Southern  California  over  highly  varied  terrain  and 
with  a  dense  observational  network  that  provided  a  robust  data  set  of  model  output 
for  analysis.  Since  results  from  a  comparison  of  the  verification  skill  scores  for  the 
FDDA  runs  with  those  run  without  the  FDDA  showed  nearly  identical  scores  (Raby 
and  Cai  2016),  only  the  model  runs  with  FDDA  were  used  for  this  study.  The 
case-study  days  from  February-March  2012  were  picked  to  vary  weather 
conditions  from  a  strong  synoptic  forcing  situation  to  a  quiescent  situation.  (The 
weather  conditions  for  each  study  day  are  described  in  Section  2.3.) 

This  study  employs  MET  Series-Analysis  to  generate  spatial-categorical- 
verification  results  for  assessing  the  WRE-N  at  tactically  significant  grid  spacings 
for  a  range  of  threshold  values  applied  to  forecasts  of  continuous  meteorological 
variables.  The  motivation  for  presenting  results  at  multiple  thresholds  came  from  a 
suggestion  by  a  colleague  who  posed  a  question  about  the  performance  of  the  model 
at  lower  thresholds  in  view  of  lower  skill  when  predicting  objects  defined  by  the 
highest  threshold  (Jameson  2016).  The  skill  scores  generated  at  a  given  threshold 
provide  an  assessment  of  the  ability  of  the  model  to  predict  the  object  defined  by 
the  threshold  similar  to  the  way  MyWIDA  uses  output  from  models  such  as 
WRE-N  to  provide  spatial  distributions  of  forecast  weather  impacts  to  Army 
missions  and  systems.  By  design  and  intent,  Army  systems  and  missions  must  be 
able  to  operate  in  all  weather  conditions,  but  there  are  rules  that  define  marginal 
and  unfavorable  conditions  in  terms  of  numerous  meteorological  variables  that  are 
intended  to  serve  as  a  general  guide  for  decision-makers  to  consider  before  planning 
or  executing  an  operation.  For  unfavorable  impacts  due  to  a  single  variable, 
MyWIDA  typically  uses  a  single  threshold — “greater  than  or  equal”  (GE)  or  “less 
than  or  equal” — for  a  given  variable  based  on  the  rules  that  define  the  unfavorable 
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weather  impacts  on  systems  or  missions.  Unfavorable  conditions  are  usually 
associated  with  the  most  extreme  condition  that  adversely  impacts  the  system  or 
mission. 

2.  Domain  and  Model 


The  ARL  WRE-N  (Dumais  et  al.  2004;  Dumais  et  al.  2013)  has  been  designed  as 
a  convection-allowing  application  of  the  WRF-ARW  model  (Skamarock  et  al. 

2008)  with  an  observation-nudging  FDD  A  option  (Fiu  et  al.  2005;  Deng  et  al. 

2009) .  For  this  investigation,  the  WRE-N  was  configured  to  run  over  a  multinest 
set  of  domains  to  produce  a  fine  inner  mesh  with  1.75-km  grid  spacing,  and  it 
leveraged  an  external  global  model  for  cold-start  initial  conditions  and  time- 
dependent  lateral  boundary  conditions  for  the  outermost  nest.  Table  1  describes  the 
dimensions  for  the  triple-nested  domain.  This  global  model  for  ARE  development 
and  testing  has  been  the  National  Centers  for  Environmental  Prediction’s  Global 
Forecast  System  (GFS)  model  (EMC  2003).  The  WRE-N  is  envisioned  to  be  a 
rapid-update  cycling  application  of  WRF-ARW  with  FDDA  and  optimally  could 
refresh  itself  at  intervals  up  to  hourly  (dependent  upon  the  observation  network) 
(Dumais  et  al.  2012;  Dumais  and  Reen  2013). 


Table  1  WRE-N  triple-nested  domain  dimensions  in  km 


East-West  dimension 

North-South  dimension 

Grid  spacing 

1780 

1780 

15.75 

761 

761 

5.25 

506 

506 

1.75 

For  this  study,  the  model  runs  had  a  base  time  of  1200  coordinated  universal  time 
(UTC)  and  produced  output  for  each  hour  from  1200  UTC  to  0600  UTC  of  the 
following  day  for  a  total  of  19  hourly  model  outputs,  which  were  produced  for  each 
of  5  days  in  February  and  March  2012.  The  modeling  domains  are  depicted  in 
Fig.  1. 
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Fig.  1  Triple-nested  model  domains;  domain  center  points  are  coincident  and  centered 
near  San  Diego,  California  (Google  Earth  2016) 

2.1  Observations  for  Assimilation 

The  initial  conditions  were  constructed  by  starting  with  the  GFS  data  as  the  first 
guess  for  an  analysis  using  observations.  Most  observations  were  obtained  from  the 
Meteorological  Assimilation  Data  Ingest  System  (MADIS)  (NOAA  2016),  except 
for  the  Tropospheric  Airborne  Meteorological  Data  Reporting  (TAMDAR) 
(Daniels  et  al.  2016)  observations,  which  were  obtained  from  AirDat,  LLC.  The 
MADIS  database  included  standard  surface  observations,  mesonet*  surface 
observations,  maritime  surface  observations,  wind-profiler  measurements, 
rawinsonde  soundings,  and  Aircraft  Communications,  Addressing,  and  Reporting 
System  (ACARS)  data.  Use  and  reject  lists  were  obtained  from  developers  of  the 
RTMA  system  (De  Pondeca  et  al.  2011),  and  these  were  used  to  filter  MADIS 
mesonet  observations.  This  quality-assurance  evaluation  is  especially  important 
given  the  greater  tendency  of  mesonet  observations  to  be  more  poorly  sited  than 
other,  more  standard,  surface  observations. 

The  Obsgrid  component  of  WRF  was  used  for  quality  control  of  all  observations. 
This  included  gross-error  checks,  comparison  of  observations  to  a  background  field 
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(here  GFS),  and  comparison  of  observations  to  nearby  observations.  We  modified 
Obsgrid  to  allow  observations  such  as  the  TAMDAR  and  ACARS  data  to  be  more 
effectively  compared  against  the  GFS  background  field.  The  quality-controlled 
observations  were  output  in  hourly,  “little_r”  formatted  text  files  for  use  as 
ground-truth  data  for  model  assessment.  We  employed  observation  nudging  to  the 
observations  from  these  same  sources  for  the  pre forecast  period  of  1200-1800  UTC 
(0-  through  6-h  lead  times),  followed  by  1  h  ramping  down  of  the  nudging  from 
1800  to  1900  UTC,  during  which  no  new  observations  are  assimilated.  The  true, 
free  forecast  period  thus  begins  at  1800  UTC  because  no  observations  after  this 
time  are  assimilated. 

2.2  Parameterizations 

For  the  parameterization  of  turbulence  in  WRE-N,  a  modified  version  of  the 
Mellor-Yamada-Janjic  (MYJ)  Planetary  Boundary  Layer  (PBL)  (Janjic  1994) 
scheme  was  used.  This  modification  decreases  the  background  turbulent  kinetic 
energy  and  alters  the  diagnosis  of  the  boundary -layer  depth  used  for  model  output 
and  data  assimilation  (Reen  et  al.  2014).  The  WRF  single-moment,  5-class 
microphysics  parameterization  is  used  on  all  domains  (Hong  et  al.  2004),  while  the 
Kain-Fritsch  (Kain  2004)  cumulus  parameterization  is  used  only  on  the  15.75-km 
outer  domain.  For  radiation,  the  Rapid  Radiative  Transfer  Model  (RRTM) 
parameterization  (Mlawer  et  al.  1997)  is  used  for  longwave  radiation  and  the 
Dudhia  (1989)  scheme  for  shortwave  radiation.  The  Noah  land-surface  model 
(Chen  and  Dudhia  2001a,  2001b)  is  used.  Additional  references  and  other  details 
for  these  parameterization  schemes  are  available  from  Skamarock  et  al.  (2008). 
Table  2  lists  the  WRF  configuration  settings. 
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Table  2  WRE-N  configuration 


Configuration 

WRF-ARW  V3.4.1  Yes 

Obs -nudging  FDD  A  Yes 

Multinest  (15.75/5.25/1.75  km)  Yes 

MADIS  observations  (FDDA)  Yes 

TAMDAR  observations  (FDDA)  Yes 

Ship/buoy  observations  (FDDA)  Yes 

Filter  obs  (use/reject)  (FDDA)  Yes 

RUNWPSPLUS  quality  control  (FDDA)  Yes 
Obs-nudge  rad  120,60,20  Yes 

MYJ-PBL  scheme  (modified)  Yes 

WRF.sgl-moment,  5-class  microphysics  Yes 
Option  8 — microphysics  Yes 

End  FDDA  360  min  Yes 

Kain-Fritsch  Cum  Param  (outer  domain)  Yes 
RRTM  long-wave  rad  (Mlawer)  Yes 

Shortwave  rad  (Dudhia)  Yes 

Noah  land-surface  model  Yes 

Fix  for  nudge  to  low  water  vapor  Yes 

Model  Top  lOhPa  Yes 

Feedback  on  Yes 

Obs  weighting  function  4E-4  Yes 

57  vertical  levels  Yes 

48-s  time  step  Yes 


2.3  Case-Study  Days 


The  case-study  days  were  selected  on  the  basis  of  the  prevailing  synoptic  weather 
conditions  over  the  nested  domains.  Table  3  provides  a  short  description  of  these 
conditions. 

Table  3  Synoptic  conditions  for  the  case-study  days  considered 

Case 

Dates  (all  2012) 

Description 

1 

February  07-08 

Upper-level  trough  moved  onshore,  which  led  to  widespread 
precipitation  in  the  region. 

2 

February  09-10 

Quiescent  weather  was  in  place  with  a  500-hPa  ridge  centered 
over  central  California  at  1200  UTC. 

3 

February  16-17 

An  upper-level  low  located  near  the  California- Arizona  border 
with  Mexico  at  1200  UTC  brought  precipitation  to  that  portion 
of  the  domain.  This  pattern  moved  south  and  east  over  the 
course  of  the  day. 

4 

March  01-02 

A  weak  shortwave  trough  resulted  in  precipitation  in  northern 
California  at  the  beginning  of  the  period  that  spread  to  Nevada, 
then  moved  southward  and  decreased  in  coverage. 

5 

March  05-06 

Widespread  high-level  cloudiness  due  to  weak  upper-level  low 
pressure  but  very  limited  precipitation. 
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2.4  Observations  for  Verification 


The  LAPS  gridded  observation  data  sets  produced  by  NOAA-GSD  consisted  of  12 
hourly  Gridded  Binary  format,  edition  2  (GRIB2)  files  of  2-m  above-ground-level 
(AGL)  temperature  (TMP),  relative  humidity  (RH),  and  dew-point  temperature 
(DPT)  and  10-m  AGL  U-component  and  V-component  winds  for  the  period  of 
1200-2300  UTC  (forecast  lead  times  0  through  11)  on  each  of  the  5  cases.  The 
output  grid  used  by  the  LAPS  was  289  x  289  with  1.75-km  grid  spacing. 

3.  Data  Preparation  Using  MET 

The  model  and  observational  data  were  preprocessed  into  the  formats  required  by 
MET  Series-Analysis.  The  WRE-N  model  output  data  were  converted  from  native 
Network  Common  Data  Form  (NetCDF)  files  to  hourly  Gridded  Binary  format, 
edition  1  (GRIB)  files  by  the  WRF  Unified  Post  Processor,  which  destaggers  the 
data  onto  an  Arakawa-A  Grid  containing  288  x  288  grid  points.  The  hourly  GRIB2 
files  on  a  289  x  289  grid  had  to  be  remapped  to  the  288  x  288  grid  to  match  that  of 
the  WRE-N  grid.  The  NCAR  “COPYGB”  utility  program  was  used  to  remap  the 
observations  and  convert  the  files  to  GRIB  (DTC  2016).  The  author  used  MET 
Series-Analysis  to  generate  the  grid-to-grid,  categorical-error  statistics  for  surface 
meteorological  variables  TMP  and  DPT  in  degrees  Kelvin  (K),  RH  (%),  and  wind 
speed  in  meters  per  second  (WIND)  for  every  grid  point  in  the  model  domain  to 
provide  a  way  to  see  the  spatial  distribution  of  the  errors.  Series-Analysis  computed 
the  contingency-table  statistics  and  skill  scores  for  each  forecast  hour  for  5  different 
thresholds  (categories)  at  every  grid  point  over  all  12  forecast  lead  times  and  all  5 
case-study  days.  The  thresholds  were  specified  using  the  FORTRAN  convention  of 
“GE”  to  indicate  greater  than  or  equal  to  the  given  threshold  value  and  are  shown 
in  Table  4. 


Table  4  Thresholds  used  in  MET  Series-Analysis 


TMP 

(K) 

DPT 

(K) 

RH 

(%) 

WIND 

(m/s) 

270 

262 

25 

2 

275 

267 

40 

5 

280 

272 

55 

8 

285 

277 

70 

11 

290 

282 

85 

14 

MET  Series-Analysis  generates  many  categorical  skill  scores  and 
contingency-table  statistics.  Of  these,  Table  5  lists  those  which  were  output 
initially. 
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Table  5  Initial  Series-Analysis  skill  scores  and  contingency-table  statistics 


Score/statistic 

Description 

BASER 

base  rate 

FMEAN 

mean  forecast  value 

PODY 

hit  rate 

FAR 

false-alarm  ratio 

FBIAS 

frequency  bias 

CSI 

Critical  Success  Index 

GSS 

Gilbert  Skill  Score 

ACC 

accuracy 

For  this  study,  the  author  reduced  the  analysis  to  consider  only  CSI  and  FBIAS  for 
the  variables  of  2-m  AGL  TMP  and  RH  and  10-m  AGL  WIND  to  accomplish  a 
preliminary  assessment  of  the  accuracy  of  WRE-N  output  that  was  filtered  by 
application  of  multiple  thresholds.  Table  6  shows  the  variables  and  thresholds  used 
in  the  analysis.  The  Series-Analysis  output  NetCDF  file  was  ingested  into  the 
Unidata  Integrated  Data  Viewer,  which  was  used  to  generate  graphics  displaying 
the  spatial  distribution  of  the  CSI  and  FBIAS  over  the  WRE-N  domain  (Murray  et 
al.  2003). 


Table  6  Analysis  thresholds 


TMP 

(K) 

RH 

(%) 

WIND 

(m/s) 

290 

85 

11 

285 

70 

8 

280 

55 

5 

275 

40 

2 

4.  Analysis  of  MET  Series-Analysis  Results 


The  CSI  and  FBIAS  are  defined  by  a  ratio  of  counts  determined  using  a  2  x  2 
contingency  table.  Table  7  shows  the  contingency  table  with  notation  consistent 
with  the  formulae  for  the  scores  and  statistics  as  implemented  in  the  MET  (NCAR 
2013). 
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Table  7  2x2  contingency  table  from  the  MET  User’s  Guide  4.1  (NCAR  2013) 


Forecast 

Observation 

Total 

o  =  l  (e.g.,  “Yes”) 

o  =  0  (e.g.,  “No”) 

F  =  1  (e.g.,  “Yes”) 

nil 

nlO 

nl.  =  nll  +  nlO 

F  =  0  (e.g.,  “No”) 

nOl 

nOO 

nO.  =  nOl  +  nOO 

Total 

n.l  =  nil  +  nOl 

n.O  =  nlO  +  nOO 

T  =  nl  1  +  nlO  +  nOl  +  nOO 

a2  x  2  contingency  table  in  terms  of  counts.  The  mj  values  in  the  table  represent  the  counts  in  each 
forecast-observation  category,  where  i  represents  the  forecast  and  j  represents  the  observations.  The 
symbols  in  the  Total  cells  represent  sums  across  categories. 
bThe  counts,  nil,  mo,  not,  and  noo,  are  sometimes  called  the  “hits”,  “false  alarms”,  “misses”,  and 
“correct  rejections”,  respectively. 

c  By  dividing  the  counts  in  the  cells  by  the  overall  total,  T ,  the  joint  proportions,  pn,  pio,  poi,  and  poo 
can  be  computed.  Note  that  pn  +  pio  +  poi  +  poo=  1.  Similarly,  if  the  counts  are  divided  by  the  row 
(column)  totals,  conditional  proportions,  based  on  the  forecasts  (observations)  can  be  computed. 


The  CSI  score  (Eq.  1)  is  computed  as  described  in  the  MET  ETser’s  Guide  4.1 
(NCAR  2013): 


CSI  = 


(1) 


with  CSI  being  the  ratio  of  the  number  of  times  the  event  was  correctly  forecasted 
to  occur  to  the  number  of  times  it  was  either  forecasted  or  occurred.  CSI  ignores 
the  “correct  rejections”  category  (i.e.,  noo). 

The  value  of  the  CSI  ranges  between  0  and  1 ,  with  1  being  a  perfect  forecast  and  0 
being  a  forecast  with  no  skill. 

The  FBIAS  score  is  computed  as  described  below  in  Eq.  2: 


„ .  n, ,  +  nm  n, 
Bias  =  — - -  =  — 


(2) 


with  FBIAS  defined  as  the  ratio  of  the  total  number  of  forecasts  of  an  event  to  the 
total  number  of  observations  of  the  event.  A  “good”  value  of  frequency  bias  is  close 
to  1 ;  a  value  greater  than  1  indicates  the  event  was  forecasted  too  frequently  and  a 
value  less  than  1  indicates  the  event  was  not  forecasted  frequently  enough. 


4.1  Compare  CSI  and  FBIAS  for  the  4  Threshold  Values 

A  display  of  the  spatial  distribution  of  the  CSI  for  TMP  for  4  different  thresholds 
is  shown  in  Fig.  2.  The  plot  for  TMP  GE  290  shows  the  CSI  score  for  the  case  with 
the  highest  threshold  that  was  generated  for  the  previous  study  by  Raby  and  Cai 
(2016).  Note  the  areas  that  are  white  in  color  do  not  have  a  CSI  score  due  to 
nonoccurrences  of  the  GE  290-K  event.  The  other  plots  show  how  CSI  changes 
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from  generally  lower  CSI  scores  to  higher  scores  as  the  threshold  value  is  lowered. 
Visually,  this  trend  appears  as  a  transition  from  cooler  to  warmer  colors  with  dark 
orange  indicating  a  perfect  CSI  score  of  1.  At  275  K,  the  CSI  over  most  of  the 
domain  is  near  perfect  with  slightly  lower  scores  over  mountainous  terrain  and  the 
Sea  of  Cortez.  This  trend  matches  the  expected  trend  as  described  by  Jolliffe  and 
Stephenson  (2012). 
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Fig.  2  CSI  for  2-m  AGL  TMP  for  4  thresholds 

A  display  of  the  spatial  distribution  of  the  FBIAS  for  TMP  for  4  different  thresholds 
is  shown  in  Fig.  3. 
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Fig.  3  FBIAS  for  2-m  AGL  TMP  for  4  thresholds 

The  plot  for  TMP  GE  290  shows  the  FBIAS  score  for  the  case  with  the  highest 
threshold  that  was  generated  for  the  previous  study  by  Raby  and  Cai  (2016).  Note 
the  areas  that  are  white  in  color  do  not  have  a  FBIAS  score  due  to  nonoccurrences 
of  the  GE  290-K  event.  The  other  plots  show  the  same  improving  trend  as  that 
observed  for  CSI  with  decreasing  bias  as  the  threshold  is  lowered.  Visually,  this 
trend  appears  as  a  transition  to  the  green  color  indicating  an  FBIAS  score  of  1  or 
no  bias.  Again,  this  trend  agrees  with  the  expected  trend  according  to  Jolliffe  and 
Stephenson  (2012).  The  WRE-N  at  the  lowest  threshold  performs  very  well  over 
almost  the  entire  domain  with  almost  no  bias. 

A  display  of  the  spatial  distribution  of  the  CSI  for  RH  for  4  different  thresholds  is 
shown  in  Fig.  4. 
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Fig.  4  CSI  for  2-m  AGL  RH  for  4  thresholds 

The  plot  for  RH  GE  85%  shows  the  CSI  score  for  the  case  with  the  highest  threshold 
that  was  generated  for  the  previous  study  by  Raby  and  Cai  (2016).  Note  the  areas 
that  are  white  in  color  do  not  have  a  CSI  score  due  to  nonoccurrences  of  the  GE 
85%  event.  The  other  plots  show  how  CSI  increases  as  the  threshold  value  is 
lowered.  Visually,  this  trend  appears  as  a  transition  from  cooler  to  warmer  colors 
with  dark  orange  indicating  a  perfect  CSI  score  of  1.  At  40%,  the  CSI  over  most 
areas  of  the  domain  has  improved,  especially  over  the  ocean  and  to  a  lesser  extent 
over  land.  This  trend  matches  the  expected  trend  as  described  by  Jolliffe  and 
Stephenson  (2012). 

A  display  of  the  spatial  distribution  of  the  FBIAS  for  RH  for  4  different  thresholds 
is  shown  in  Fig.  5. 
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Fig.  5  FBIAS  for  2-m  AGL  RH  for  4  thresholds 

The  plot  for  RH  GE  85%  shows  the  FBIAS  score  for  the  case  with  the  highest 
threshold  that  was  generated  for  the  previous  study  by  Raby  and  Cai  (2016).  The 
other  plots  show  the  same  improving  trend  as  that  observed  for  CSI  with  decreasing 
bias  as  the  threshold  is  lowered.  Note  the  areas  that  are  white  in  color  do  not  have 
an  FBIAS  score  due  to  nonoccurrences  of  the  GE  85%  event  and,  to  a  lesser  extent, 
the  GE  70%  event.  Visually,  the  improving  trend  appears  as  a  transition  to  the  green 
color  indicating  an  FBIAS  score  of  1  or  no  bias.  Again,  this  trend  agrees  with  the 
expected  trend  according  to  Jolliffe  and  Stephenson  (2012).  The  WRE-N  at  the 
lowest  threshold  performs  very  well  over  a  significant  portion  of  the  entire  domain 
with  almost  no  bias.  The  areas  where  there  is  an  overforecasting  bias  appear  to  be 
those  with  lower  elevation  over  land,  the  Salton  Sea,  the  Sea  of  Cortez,  and  over 
the  ocean  in  some  parts  of  the  coastal  zone. 

A  display  of  the  spatial  distribution  of  the  CSI  for  WIND  for  4  different  thresholds 
is  shown  in  Fig.  6. 
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Fig.  6  CSI  for  10-m  AGL  WIND  for  4  thresholds 

The  plot  for  WIND  GE  1 1  m/s  shows  the  CSI  score  for  the  case  with  the  highest 
threshold  that  was  generated  for  the  previous  study  by  Raby  and  Cai  (2016).  The 
other  plots  show  how  CSI  increases  as  the  threshold  value  is  lowered.  Note  the 
areas  that  are  white  in  color  do  not  have  an  FBIAS  score  due  to  nonoccurrences  of 
the  GE  11-m/s  event  and,  to  a  lesser  extent,  the  GE  8-m/s  event.  Visually,  the 
improving  trend  appears  as  a  transition  from  cooler  to  warmer  colors  with  dark 
orange  indicating  a  perfect  CSI  score  of  1 .  At  2  m/s,  the  CSI  over  most  areas  of  the 
domain  has  improved,  especially  over  the  ocean  and  the  Sea  of  Cortez.  This  trend 
matches  the  expected  trend  as  described  by  Jolliffe  and  Stephenson  (2012). 

A  display  of  the  spatial  distribution  of  the  FBIAS  for  WIND  for  4  different 
thresholds  is  shown  in  Fig.  7. 
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Fig.  7  FBIAS  for  10-m  AGL  WIND  for  4  thresholds 
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The  plot  for  WIND  GE  1 1  m/s  shows  the  FBIAS  score  for  the  case  with  the  highest 
threshold  that  was  generated  for  the  previous  study  by  Raby  and  Cai  (2016).  The 
other  plots  show  the  same  improving  trend  as  that  observed  for  CSI  with  decreasing 
bias  as  the  threshold  is  lowered.  Visually,  this  trend  appears  as  a  transition  to  the 
green  color  indicating  an  FBIAS  score  of  1  or  no  bias.  Again,  this  trend  agrees  with 
the  expected  trend  according  to  Jolliffe  and  Stephenson  (2012).  Note  there  are 
extensive  areas  of  white  indicating  no  occurrence  of  events  defined  by  all  4 
thresholds.  Reducing  the  threshold  resulted  in  a  reduction  of  these  nonevents.  At 
the  GE  2-m/s  threshold,  the  remaining  white  areas  are  due  to  the  nonoccurrence  of 
observed  winds  that  were  GE  2  m/s,  resulting  in  the  FBIAS  score  being  undefined. 
The  WRE-N  at  the  lowest  threshold  performs  very  well  over  a  significant  portion 
of  that  entire  domain  with  almost  no  bias.  The  areas  where  there  is  an 
overforecasting  bias  appear  to  be  mostly  over  land. 


4.2  Summary  of  the  Comparison  of  Scores  for  the  4  Threshold 
Values 

The  frequency  of  occurrence  of  forecast  events  determined  by  the  application  of 
thresholds  to  a  continuous  variable  field  changes  spatially  over  the  domain, 
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affecting  the  CSI  and  FBIAS  scores  in  a  way  that  may  give  a  misleading  assessment 
of  the  model’s  ability  to  forecast  objects.  Analysis  of  these  scores  for  a  range  of 
thresholds  shows  the  WRE-N  performs  as  expected  with  better  scores  achieved 
using  lower  threshold  values. 

When  the  thresholds  are  at  the  high  end  of  the  full  range  or,  in  some  cases,  the 
middle  and  lower  segments  of  the  range  of  the  variable,  there  were  areas  where  no 
events  occurred,  which  limited  the  area  where  scores  are  calculated.  Analysis  of 
more  categorical  scores  and  contingency-table  statistics — as  well  as  assessment 
using  object-based  methods — is  needed  to  overcome  this  limitation  and  improve 
assessments  of  the  ability  of  the  model  to  forecast  objects  defined  using  higher 
threshold  values.  Improved  assessments  of  this  aspect  of  model  performance  will 
lead  to  model  improvements  to  enable  better  prediction  of  objects  rendered  using 
higher  thresholds  that  will,  in  turn,  translate  into  better  MyWIDA  unfavorable 
impact  predictions. 

The  accuracy  of  the  model  judged  from  the  scores  varies  considerably  over  the 
domain  due  to  a  combination  of  terrain  characteristics  and  mesoscale  variations  in 
the  air-mass  characteristics.  This  is  true  of  scores  produced  for  all  thresholds. 
Analysis  of  more  scores  and  contingency-table  statistics  is  needed  to  better  relate 
them  to  terrain  and  air-mass  characteristics.  Use  of  a  Geographic  Information 
System  (GIS)  may  be  particularly  useful  for  more  in-depth  error  analysis  based  on 
domain  partitioning.  The  implication  of  this  variability  suggests  that  weather 
impacts  on  Army  systems  and  missions  vary  considerably  in  space. 

The  accuracy  of  the  model  at  higher  thresholds,  judging  from  these  results,  is  not 
as  good  as  that  using  lower  thresholds.  The  implication  of  this  apparent  lack  of  skill 
at  higher  thresholds  is  the  prediction  of  unfavorable  weather  impacts  generated  by 
the  MyWIDA  TDA  may  not  be  as  accurate  as  desired.  However,  for  marginal 
weather  impacts,  which  are  associated  with  somewhat  lower  threshold  values,  the 
skill  of  the  model  may  be  better  based  on  these  results.  Thus,  it  is  important  to 
conduct  studies  that  use  the  actual  system  and  mission  thresholds  to  more 
accurately  assess  the  ability  of  the  model  to  predict  objects  that  are  meaningful  to 
the  Army.  That  said,  use  of  actual  thresholds  will  significantly  reduce  the  number 
of  locations  and  time  periods  for  which  the  atmospheric  conditions  can  provide  data 
sets  with  the  range  of  variable  values  that  encompass  actual  thresholds.  The  impact 
of  these  2  situations — each  at  odds  with  the  other — has  to  be  judged  with  the 
understanding  that  meaningful  conclusions  about  model  performance  can  only 
come  from  the  analysis  of  large  numbers  of  cases.  So,  there  is  a  tradeoff  between 
analysis  of  1)  data  sets  for  fewer  cases  where  tactically  significant  thresholds  can 
be  applied  and  2)  the  more  numerous  data  sets  that  were  developed  using  thresholds 
defined  by  using  the  actual  ranges  of  the  variables  present  over  the  domain.  The 
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former  presents  challenges  due  to  lack  of  statistically  significant  numbers  of  cases; 
the  latter  presents  a  challenge  of  limited  application  for  assessment  of  the  ability  of 
models  to  forecast  objects  using  mission-  and  system- specific  thresholds. 

5.  Conclusion  and  Final  Comments 


The  author  found  that  the  CSI  and  FBIAS  skill  scores  produced  using  a 
spatial-categorical-verification  method  with  multiple  threshold  values  for  each  of 
the  studied  variables  improve  with  decreasing  threshold  value.  The  amount  of 
improvement  was  not  the  same  over  the  entire  domain,  however.  The  study  found 
that  the  frequency  of  occurrence  of  forecast  events  determined  by  the  application 
of  a  high  threshold  value  to  a  continuous  variable  field  varies  over  the  domain  and 
affects  the  CSI  and  FBIAS  scores  in  a  way  that  may  give  a  misleading  assessment 
of  the  model’s  ability  to  forecast  objects.  Thresholds  that  define  objects  at  the  high 
end  and,  to  a  lesser  extent,  the  mid-  and  lower  portions  of  the  range  of  a  variable 
will  produce  scores  over  a  subset  of  the  domain  because  in  some  parts  of  the  domain 
there  were  no  event  occurrences.  This  restricts  the  scoring  to  those  areas  where 
events  occurred.  As  the  threshold  decreases,  the  numbers  of  nonevents  decreases, 
allowing  scores  to  be  calculated  over  more  of  the  domain.  To  more  accurately 
assess  the  ability  of  the  model  to  predict  objects  defined  by  high  thresholds,  studies 
are  needed  that  use  additional  scores  and  statistics  that  are  possible  with  the 
spatial-categorical  method.  Further,  object-based  methods  provide  additional 
information  about  the  ability  to  predict  objects.  Raby  and  Cai  (2016)  recommended 
a  more  comprehensive  approach  combining  several  traditional  and  nontraditional 
methods  for  assessing  the  ability  of  the  model  to  predict  objects  defined  by 
thresholds;  these  numerous  scores  and  statistics,  when  analyzed  together,  may 
reveal  more  information  about  model  performance. 

Another  difficulty  that  arises  when  using  high  threshold  values  was  discussed  by 
the  author  (Raby  2016).  The  CSI  and  FBIAS  scores  presented  in  this  report  were 
reviewed  by  Cai  (2016),  who  attributed  the  lack  of  skill  at  high  thresholds  to 
possibly  the  reduced  size  of  objects  that  are  defined  by  the  high  threshold,  which 
leads  to  increases  in  model  displacement  errors.  Raby  (2016)  presented  results  from 
object  analysis  at  multiple  thresholds  showing  the  objects  defined  at  low  thresholds 
were  larger  than  objects  defined  at  high  thresholds.  For  a  given  model  displacement 
error,  the  resulting  CSI  scores  indicate  lower  skill  when  the  objects  are  small  and 
indicate  higher  skill  when  the  objects  are  larger.  To  illustrate  this  difference  in 
scores,  Fig.  8  depicts  large  and  small  objects  and  a  given  displacement  error. 


Approved  for  public  release;  distribution  is  unlimited. 

17 


Fig.  8  Object-displacement  error  for  large  and  small  objects 


The  CSI  is  calculated  from  contingency-table  statistics  and  is  the  ratio  of  the 
number  of  hits  to  the  sum  of  the  hits,  false  alarms,  and  misses.  Figure  8  shows  there 
is  considerable  agreement  for  the  large  objects  despite  the  horizontal  (east-west) 
displacement  error.  For  the  small  objects,  there  is  no  agreement  from  the  same 
displacement  and  there  is  the  potential  for  more  misses  and  less  hits,  especially  if 
there  are  numerous  small  objects.  The  displacement  error  of  small  objects  results 
in  a  significant  decrease  in  the  number  of  hits  and  increases  in  the  number  of  misses, 
which  serves  to  lower  the  CSI.  By  comparison,  the  same  displacement  error  of  large 
objects  still  results  in  a  significant  number  of  hits  and  thus  decreases  the  number  of 
misses,  which  serves  to  raise  the  CSI. 

To  further  improve  assessments  of  the  predictability  of  objects,  Raby  and  Cai 
(2016)  recommended  a  more  rigorous  approach  that  requires  the  generation  of 
larger  data  sets  of  forecast  output  and  gridded  observations  so  that  statistically 
significant  results  can  be  obtained.  This  will  be  important  when  verifying  the 
modeled  objects  defined  at  higher  thresholds,  particularly  when  WRE-N  model 
output  is  used  to  predict  the  more  critical  unfavorable- weather  impacts  on  Army 
systems  and  missions  using  MyWIDA. 

Finally,  to  analyze  and  understand  the  complexity  of  the  spatial  variability  of  the 
scores  revealed  by  this  study  and  the  previous  study  (Raby  and  Cai  2016),  a  GIS — 
which  the  atmospheric  sciences  have  not  extensively  used — should  be  exploited  for 
its  ability  to  contextualize  and  analyze  geospatial  information  such  as  terrain 
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type/slope,  land-use  effects,  and  other  spatial  and  temporal  variables  as  explanatory 
metrics  in  model  assessments  (Smith  et  al.  2015,  2016a,  2016b).  This  technique  has 
considerable  promise  of  becoming  an  important  new  tool  to  augment  other 
traditional  and  nontraditional  tools  for  a  comprehensive  approach  to  model 
verification. 
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List  of  Symbols,  Abbreviations,  and  Acronyms 


ACARS 

Aircraft  Communications,  Addressing,  and  Reporting  System 

AGL 

above  ground  level 

ARL 

US  Army  Research  Laboratory 

ARW 

Advanced  Research  Weather  Research  and  Forecasting  model 

CSI 

Critical  Success  Index 

DPT 

dew-point  temperature 

FBIAS 

Frequency  Bias 

FDD  A 

Four-Dimensional  Data  Assimilation 

GE 

greater  than  or  equal  to 

GFS 

Global  Forecast  System 

GIS 

Geographic  Information  System 

GRIB 

Gridded  Binary  format,  edition  1 

GRIB  2 

Gridded  Binary  format,  edition  2 

GSD 

Global  Systems  Division 

LAPS 

Local  Analysis  and  Prediction  System 

MADIS 

Meteorological  Assimilation  Data  Ingest  System 

MET 

Model  Evaluation  Tools 

MYJ 

Mellor-Y  amada-Janjic 

MyWIDA 

My  Weather  Impacts  Decision  Aid 

NCAR 

National  Center  for  Atmospheric  Research 

NetCDF 

Network  Common  Data  Form 

NO  A  A 

National  Oceanic  and  Atmospheric  Administration 

NWP 

Numerical  Weather  Prediction 

PBL 

Planetary  Boundary  Layer 

RH 

relative  humidity 

RRTM 

Rapid  Radiative  Transfer  Model 
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RTMA 

TAMDAR 

TDA 

TMP 

UTC 

WIND 

WRE-N 

WRF 

WRF-ARW 


Real-Time  Mesoscale  Analysis 

Tropospheric  Airborne  Meteorological  Data  Reporting 

Tactical  Decision  Aid 

temperature 

Coordinated  Universal  Time 
wind  speed 

Weather  Running  Estimate-Nowcast 
Weather  Research  and  Forecasting 

Weather  Research  and  Forecasting,  Advanced  Research  WRF 
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